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ABSTRACT 

This paper explores novel approaches for improving the spatial cod¬ 
ification for the pooling of local descriptors to solve the semantic 
segmentation problem. We propose to partition the image into three 
regions for each object to be described: Figure, Border and Ground. 
This partition aims at minimizing the infiuence of the image context 
on the object description and vice versa by introducing an intermedi¬ 
ate zone around the object contour. Furthermore, we also propose a 
richer visual descriptor of the object by applying a Spatial Pyramid 
over the Figure region. Two novel Spatial Pyramid configurations 
are explored: Cartesian-based and crown-based Spatial Pyramids. 
We test these approaches with state-of-the-art techniques and show 
that they improve the Figure-Ground based pooling in the Pascal 
VOC 2011 and 2012 semantic segmentation challenges. 

Index Terms — Semantic segmentation. Object recognition. 
Object segmentation. Spatial codification 

1. INTRODUCTION 

The classic approach to label the regions of an image with the ap¬ 
propriate object class has been commonly based on SIFT-like m 
and HOG-like (2) features, pooled within each region using Bag- 
of-Features (BoF) EEIISI or, more recently. Second Order Pooling 
(02P) techniques HIT]. In addition, approaches based on convo¬ 
lutional neural networks (CNN) have gained popularity among the 
scientific community thanks to the results achieved by works such 
as El, 19( and (lOl. However, CNNs need to be pre-trained on large 
databases such as ImageNet Classification (1.2 million annotated im¬ 
ages). In this paper, we investigate an alternative approach where 
features are manually designed instead of automatically learned, re¬ 
ducing the need for large data collections and costly processing ef¬ 
fort. 

Specifically, we propose to improve the visual description 
by partitioning the image into three regions (Figure, Border and 
Ground) inspired by the work reported by Uijlings et al in mi. 
Multiple authors have highlighted the importance of the spatial 
context around an object during its recognition Eiiiaini. In our 
work, we prove the potential of the Figure-Border-Ground (F-B-G) 
spatial pooling, extending the work in CD to the case of real object 
candidates and including new features in the visual description. 

On the one hand, our proposal has been tested over two state- 
of-the-art object candidate algorithms: CPMC llU and MCG iTTSl . 
Introducing the Border pool for object candidates represents a novel 
contribution with respect to the previous works unniiiiD which 
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Fig. 1. Examples where a richer spatial codification improves the 
object segmentation and recognition. Left: images to be seman¬ 
tic segmented. Middle: solution based on a Figure-Ground spatial 
pooling lb). Right: solution based on a Figure-Border-Ground spa¬ 
tial pooling. 


only considered Figure-Ground (F-G) spatial pooling. This inter¬ 
mediate area aims at minimizing the influence of the image context 
in the object description and vice versa as well as at capturing the 
rich contextual information located in the very neighbourhood of the 
object itself. 

On the Other hand, our work also explores a novel approach for 
enriching the visual description of the object. We propose to apply a 
contour-based Spatial Pyramid (SP) over the Figure region using on 
two different configurations: (i) a crown-based SP, where the object 
is divided into different crowns for pooling, and (ii) a Cartesian- 
based SP, where the object is divided into four geometric quadrants 
for pooling. These approaches for a richer spatial codification are 
combined with the 02P descriptors jb). Note that both 02P and 
BoF solutions require significantly less training data than CNNs. 

In the context of the Pascal VOC challenge named comp 5, the 
simplest training scenario implies only using the annotations from 
the segmentation dataset, discarding the bounding box annotations 
from the detection dataset. In that case, our approach improves the 
results from I6| with a performance gain of 12.9%. Figure[^shows 
two examples where the proposed richer spatial pooling based on a 
F-B-G partition improves both the object segmentation and recogni¬ 
tion with respect to a F-G spatial pooling (b). 

The remain of this paper is structured as follows. Sectionj^gives 
an overview of the related work. In Section we present the main 
contributions of our work. Section]^ gives the experimental results. 
Finally, conclusions are drawn in Sectionj^ 






2. RELATED WORK 


Our work has been mainly inspired by m, where Uijlings et al 
investigated the impact of the visual extent of an object on the Pas¬ 
cal VOC dataset using a BoF with SIFT descriptors. Their analysis 
was performed in an ideal situation where the ground truth object 
locations are used to create a separate representation with 3 types of 
regions: the object’s surrounding (Ground), near the object’s con¬ 
tour (Border) and the object’s interior (Figure). The authors in m 
reported a gain of 11.3% in accuracy when introducing the Border. 

The spatial coding of pooled features has not only been ad¬ 
dressed from the perspective of taking automatically generated re¬ 
gions as reference, but also through an arbitrary partition of the im¬ 
age. This is the case of the popular Spatial Pyramid (SP) fH, which 
consists in dividing the whole image into a grid and pooling the de¬ 
scriptors over each cell using a BoF framework. To our best knowl¬ 
edge, the works O and CD, where a SP is applied over a bounding 
box instead of at the image level, are the closest ones to our contour- 
based SP. There are also works such as where the layout of the 
SP depends on side information like object confidence maps or vi¬ 
sual salency maps, but it is also applied over the whole image. 

To analyze our approaches for improving the spatial codification 
in semantic segmentation in a real context, we have adopted a solu¬ 
tion based on the architecture proposed and released by Carreira et al 
in (b), which is briefiy described next. 150 CPMC object candidates 
na are extracted per image and each object candidate is described 
by its Figure and Ground features. Three types of enriched local fea¬ 
tures (eSIFT, eMSIFT and eLBP) are densely extracted and pooled 
using 02P la. 

3. CONTRIBUTIONS 

Our proposal consists of two main contributions: (i) the extension of 
the Figure-Border-Ground (F-B-G) pooling with object candidates, 
and (u) a new contour-based Spatial Pyramid (SP) pooling to enrich 
the spatial information of the object description. 

3.1. F-B-G pooling with object candidates 

In our work, we extend the spatial pooling based on a F-B-G image 
partition from m by exploring its impact when applied in the re¬ 
alistic case of automatically extracted object candidates instead of 
ground truth masks for the semantic segmentation challenge. As in 
CD, we define the Border region as a 5-pixel crown around the ob¬ 
ject. In contrast with CD, we define a region pool as the spatial 
layout where the local features can be centered independently of the 
extension of the spatial support over which the local descriptors are 
computed. Therefore, the local descriptors extracted from a region 
which are near the region contour can partially describe the neigh¬ 
bour region. In this way, we allow the use of the usual 4x4 SIFT 
descriptors as well as a multiscale dense feature detector instead of 
the 2x2 SIFT descriptors extracted at one single scale from CD. 
Figurej^shows an example of a F-G and a F-B-G image partitions. 

This lack of absolute isolation of the description of each region 
pool can be justified in two ways. First, multiple authors have high¬ 
lighted the importance of the spatial context around an object dur¬ 
ing its recognition Enmoa. Second, the fact that in our experi¬ 
ments, in contrast with CD, we also use a masked SIFT (MSIFT), 
which excludes any visual information coming from the neighbour 
region. Therefore, the learning process can automatically benefit 
from classes that can take advantage of the context (giving more im¬ 
portance to non-masked descriptors) as well as from those where 



Fig. 2. Example of a Figure-Ground partition (in the middle) and 
a Figure-Border-Ground partition (on the right) of the original image 
(on the left). 


Fig. 3. Example of a 4-layer crown-based (in the middle) and a 
Cartesian-based (on the right) Spatial Pyramid from an object mask 
of the original image (on the left). 


context can lead to confusion (giving more importance to masked 
descriptors). 

3.2. Contour-based Spatial Pyramid 

In a second contribution inspired by Gl, we propose to apply a 
Spatial Pyramid (SP) coding approach over the Eigure region to also 
improve the description of the interior of the object. More specifi¬ 
cally, we apply a SP centered on the object. We have performed an 
analysis based on two different spatial configurations: (i) a 4-layer 
crown-based SP, and (ii) a Cartesian-based SP. The layers of the 
crown-based SP are obtained by applying a distance transform to the 
Eigure mask. Then, the maximum value is used to define the differ¬ 
ent layers on a logarithmic base. On the other hand, the Cartesian- 
based SP divides the Eigure region into 4 geometric quadrants which 
have the center of mass of the region as origin. Eigure shows an 
example of a 4-layer crown-based SP and a Cartesian-based SP. 

4. EXPERIMENTS 

The Pascal VOC Segmentation challenge (2D provides a benchmark 
for semantic segmentation assessment. The evaluation is performed 
by means of the Average of the Accuracy per Category (AAC), 
which is defined as the ratio between the intersection and the union 
of the pixels classified as category Ck and the pixels annotated in 
the ground truth as Cfc. The Pascal VOC Segmentation dataset is 
divided into three subsets: train, validation and test. Preliminary 
experiments are performed using the train subset for training and 
the validation subset for test. Then, experiments are validated using 
both train and validation subsets for training and test subset for test. 

The experiments have been performed on the Pascal VOC 2011 
and 2012 segmentation comp5 challenge, in which no external data 
can be used for training. We address the realistic scenario where a 
ranked list of pixel-wise object candidates are automatically gener¬ 
ated. In our work, we have considered the regions proposed by the 
CPMC (H, the same technique adopted in (b), since they allow a 
fair comparison of results. However, we have also considered the 
MCG Ga, another state-of-the-art technique for object candidate 
generation, to check the consistency of our contributions. 











F[6!| 

F-B 

F-G |l6( 

F-B-G 

eSIFT 

63.85 

66.24 

66.43 

68.57 

eMSIFT 

64.81 

68.93 

67.59 

70.84 



F 

F-B 

F-B-G 

non SP 

64.81 |6l 

68.93 

70.84 

crown-based SP 

68.67 

71.05 

71.69 

Cartesian-based SP 

67.66 

71.64 

72.68 


Table 1. Gain of introducing the Border for pooling. Results using 
GT masks. Training over train 11 and evaluation over vail 1. F refers 
to Figure, B refers to Border and G refers to Ground. 


4.1. Results with ideal object candidates 

Experiments have been first performed using the ground truth object 
masks (ideal object candidates). The use of these masks allows us 
to isolate pure recognition effects from segment selection and infer¬ 
ence problems. This way it is possible to assess the improvements 
provided by the various spatial codifications in an ideal scenario. 

4.1.1. F-B-G spatial pooling 

Tableshows the results for different image spatial representations. 
The first and third columns correspond to the configurations from 
where the Border region is included in the Ground description. We 
propose two additional configurations: (i) Figure(F)-Border(B), and 
(ii) Figure(F)-Border(B)-Ground(G). 

On the one hand, the F-B configuration tries to answer the fol¬ 
lowing question: How important is the entire background in compar¬ 
ison with the bordering region? When eSIFT descriptors are pooled, 
using only the Figure and Border regions and discarding the Ground 
is almost as good as using the classical F-G partition of the whole 
image (66.24 and 66.43 respectively). If eMSIFT descriptors are 
pooled instead, the average accuracy achieved by pooling them over 
F-B is even better than over F-G (68.93 and 67.59 respectively). This 
indicates that the richest contextual information for object recogni¬ 
tion is located in the very near neighbourhood of the object itself. 

On the other hand, the F-B-G configuration aims at showing the 
benefits of also including the rest of the background as a region pool. 
Although pooling over Border can give better results than pooling 
over Ground as seen before. Ground description still carries useful 
information for object recognition. 

Once eSIFT and eMSIFT have been independently analyzed, we 
explore the joint combination of different descriptors by concatena¬ 
tion. This study is performed to assess the impact of our proposal on 
the configuration with the best results obtained in (6) : with eSIFT- 
F, eSIFT-G, eMSIFT-F and eLBP-F (72.98). Analogously, using 
only eSIFT and eMSIFT descriptors and the proposal of partitioning 
the image into three regions (F-B-G) improves the average accuracy 
up to 73.84 (see Tablewith respect to the 72.48 obtained in (3 
(eSIFT and eMSIFT over F-G). 

4.1.2. Contour-based Spatial Pyramid 

In this section, we explore the proposal of improving the visual de¬ 
scription by using the contour-based SP presented in Section Ta- 
ble[^ shows the results of applying the two Spatial Pyramids config¬ 
urations (crown-based and Cartesian-based) over the Figure region 
for the eMSIFT descriptors. The results show that both types of SPs 
give a significative improvement of the average accuracy classifica¬ 
tion, especially when only the Figure region is considered. Although 
the crown-based SP is better than the Cartesian-based SP for the Fig¬ 
ure region, the Cartesian-based SP gives the best performance when 
the Border and Ground regions are also considered. We believe that 
this behavior is caused by the fact that the description of the Border 


Table 2. Comparison between the non use of SP for the Figure re¬ 
gion and the crown-based and Cartesian-based SP approaches for 
GT masks. Training over train 11 and evaluation over valll. 


Figure 

SP(F) 

Border 

Ground 

AAC 

eS-i-eMS-i-eL 



eS 

72.98 |6l 

eS-i-eMS 


eMS-i-eS 

eMS-heS 

73.84 

eS-i-eMS-i-eL 

eMS 

eMS-i-eS 

eMS-heS 

75.86 


Table 3. Gain of introducing the Border for pooling, applying the 
Cartesian-based Spatial Pyramid over the Figure (SP(F)) and com¬ 
bining eSIFT (eS), eMSIFT (eMS) and eLBP (eL). Results using GT 
masks. Training over train 11 and evaluation over vail 1. 


region is more diverse with respect to the geometric quadrants than 
the outermost layer of the crown-based SP. 

The performance achieved by using only the eMSIFT descriptor 
(72.68) is almost as good as the accuracy achieved in by combin¬ 
ing eMSIFT, eSIFT and eLBP (72.98). Table explores the joint 
combination of different descriptors by concatenation when both 
Figure-Border-Ground spatial pooling and Cartesian-based Spatial 
Pyramid are applied. As shown in this table, the use of both ap¬ 
proaches improves the average accuracy up to 75.86. 

4.2. Results with CPMC Object Candidates 

In this section, we evaluate our two main contributions over CPMC 
object candidates. Note that there is a tight link between CPMC and 
the 02P-based architecture from m since these object candidates 
have been reranked and filtered based on the same features used for 
classification, i.e. 02P features. 

4.2.1. F-B-G spatial pooling 

First, the experiments have been carried out in Pascal VOC 2011 us¬ 
ing the train subset for training and the validation subset for evalua¬ 
tion. The partitioning of the image for each object candidate into the 
Figure, Border and Ground regions improves the performance up to 
34.81 (with eSIFT) in comparison with the original partitioning into 
Figure and Ground regions (28.58 ll6l). 

Next, we have performed experiments pooling the three differ¬ 
ent descriptors (eSIFT, eMSIFT and eLBP) over the three proposed 
regions. The original performance achieved in is 37.15. Our 
results from Table show that using the partitioning of the image 
into three regions for pooling the descriptors increases the average 
accuracy up to 38.91, which represents an increase of 1.76 points. 

For comp5, the experiments have been carried out using only 
the segmentation annotations available for the train and val sets of 
the segmentation challenge, discarding the bounding box annota¬ 
tions of the detection challenge. The comparison between F-G and 
F-B-G poolings is shown in Table|^for both Pascal VOC 2011 and 
2012. The partitioning of the image into three regions (F-B-G) gives 
the best performance, improving the average accuracy classification 
5.0 and 2.3 points with respect to the F-G pooling for VOC 2011 






























Figure 

Border 

Ground 

AAC 

eSIFT-heMSIFT-heLBP 


eSIFT 

37.15 |6l 

eSIFT-heMSIFT-heLBP 

eSIFT 

eSIFT 

38.91 


Table 4. Introducing the Border region with CPMC object candi¬ 
dates. Training over train 11 and evaluation over valll. 



F-G|^ 

F-B-G 

VOC 11 

38.8 

43.8 

VOC12 

39.9 

42.2 


Table 5. Results using CPMC object candidates for comp5 2011 and 
2012 and different image representations: F-G and F-B-G 


and VOC 2012 respectively. Notice that other results given by the 
state-of-the-art techniques (2211231 have been obtained by using the 
bounding box annotations from the detection challenge, which is out 
of the scope of this paper. Analyzing the results by categories, the 
F-B-G image partitioning improves the classification accuracy in 17 
out of 20 categories in VOC 2011. In VOC 2012, the F-B-G ap¬ 
proach improves the accuracy in 13 out of 20 categories. 

4.2.2. Contour-based Spatial Pyramid 

Once the partitioning of the image into three regions has been val¬ 
idated for CPMC object candidates, we proceed to validate the use 
of the Spatial Pyramid over the Figure region. As before, the ex¬ 
periments are first evaluated over the validation subset. Using the 
Cartesian-based SP over the Figure region with the eSIFT descrip¬ 
tor and ignoring both the Border and Ground regions increases the 
perfomance up to 34.56, which is close to the improvement also 
achieved by the partitioning of the image into three regions (34.81). 

Applying both proposals, i.e. the Cartesian-based SP over the 
Figure region and the F-B-G pooling, results in an average accuracy 
of 37.38. Notice that this result has been achieved using only eSIFT, 
whereas the best perfomance achieved in (^ is 37.15, which uses a 
combination of eSIFT, eMSIFT and eLBP. An average accuracy of 
39.62 is achieved when the three descriptors are combined with the 
use of the three regions and the Cartesian-based SP (see Table [^. 

For comp5, adding the Cartesian-based SP over the Figure re¬ 
gion decreases the performance in 3.5 points for VOC 2011 (40.3) 
and 1.4 points for VOC 2012 (40.8). This decrease was not expected 
based on the tendency shown in the previous experiments using the 
train set for training and the val set for evaluation for both ground 
truth object masks and CPMC object candidates. The use of the SP 


Figure 

SP(F) 

Border 

Ground 

AAC 

eS-heMS-i-eL 



eS 

37.15 (6) 

eS 

eS 

eS 

eS 

37.38 

eS-i-eMS 

eS 

eS 

eS 

39.21 

eS-heMS-i-eL 

eS 

eS 

eS 

39.62 


Table 6. Results using CPMC object candidates for diferent image 
spatial representations and combining eSIFT (eS), eMSIFT (eMS) 
and eLBP (eL) and applying the Cartesian-based Spatial Pyramid 
over Figure. Training over train 11 and evaluation over valll. 


over the Figure region only improves the accuracy in 4 categories in 
VOC 2011 and in 8 categories in VOC 2012. 

4.3. Results with MCG Object Candidates 

Our spatial pooling approach has also been checked in another state- 
of-the-art object candidate generation: Multiscale Combinatorial 
Grouping (MCG) (Tsl . When the baseline solution given by (h) 
based on 02P features pooled over Figure-Ground is applied over 
MCGs instead of CPMCs, the average accuracy drops to 30.88 with 
respect to the 37.15 achieved with CPMCs. 

This drop in the performance seems to be in contradiction with 
the results reported in fBl where for the 150 top-ranked object can¬ 
didates both techniques give a similar performance for segmentation 
(without considering recognition). We believe that such a difference 
in the performance regarding the semantic segmentation is due to the 
fact that CPMCs have been specifically reranked for the 02P-based 
architecture proposed in |f6l|. Although about 800 CPMC generic 
object candidates per image are extracted and ranked based on mid¬ 
level descriptors and Gestalt features, a linear regressor also based 
on the 02P features is learned to rerank and filter them to generate 
the final pool of up to 150 CPMCs used in (Sj. Therefore, the fea¬ 
tures used for classification (02P) are also used for CPMC selection. 
On the other hand, MCG object candidates are ranked based only on 
mid-level descriptors and Gestalt features. 

However, we have also checked our spatial pooling proposals 
over the 150 top-ranked MCG object candidates. The F-B-G spatial 
pooling increases the performance up to 34.09, which represents a 
gain of 3.21 points with respect to the F-G spatial pooling (30.88). 
For such a spatial pooling, the classification accuracy is improved 
for 15 out of 20 categories. 

Furthermore, when the Cartesian-based SP is applied over the 
Figure region besides using the F-B-G spatial pooling, the accuracy 
is increased up to 36.10, a gain of 2.01 points with respect to the F- 
B-G pooling (34.09) and 5.22 points with respect to the F-G pooling 
(30.88). Applying the Cartesian-based SP improves the accuracy for 
16 out of 20 categories with respect to the F-B-G pooling and for 19 
out of 20 categories with respect to the original F-G pooling. 

Although the results given by MCGs are worse than the ones 
achieved with CPMCs, we consider that these experiments illustrate 
the robustness of our spatial pooling contributions with object can¬ 
didates for semantic segmentation. 

5. CONCLUSIONS 

We have presented two contributions for improving the spatial pool¬ 
ing beyond the classic Figure-Ground partitioning to solve the se¬ 
mantic segmentation problem. 

On the one hand, we have extended the original idea from im 
where a Figure-Border-Ground spatial pooling is applied in an ideal 
situation to a realistic scenario with the use of object candidates. 
This richer spatial pooling has been tested with state-of-the-art tech¬ 
niques (CPMC and MCG object candidates and 02P features), lead¬ 
ing to improvements of the average accuracy in all scenarios. 

On the other hand, we have explored two different configura¬ 
tions (crown-based and Cartesian-based) of Spatial Pyramid applied 
over the Figure region. Although this richer spatial pooling increased 
the performance when the system was evaluated over the validation 
subset, this trend was not observed when it was eventually assessed 
over the test subset. 




























Further visual results can be found in our project sit^ 
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